perm filename TEX.ONE[1,DEK] blob
sn#332745 filedate 1978-02-09 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00017 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00004 00002 Preliminary description of TEX D Knuth, July 15, 1977
C00008 00003 In order to explain TEX more fully, I will alternate between very low-level
C00015 00004 \require ACPhdr % 1
C00032 00005 The first thing that must be emphasized about this example is that it is much more
C00041 00006 It is time to explain TEX's mechanism for stretching and shrinking.
C00051 00007 Now let's look at more of the example. Lines 21 and 31, etc., are blank lines
C00058 00008 Line 48 begins a "\topinsert", one of the important features needed in page layout.
C00063 00009 The "\group(){...}" in lines 113, 143, etc. and the "\group/.{...}" in line 152
C00067 00010 The next step in understanding TEX is to go back to the low level and see
C00077 00011 Assignment actions: I mentioned that the ``pure input'' contains codes for
C00083 00012 Control structure: It is now high time to consider TEX's paragraph-building
C00100 00013 Here now is the code for \ACPpages which shows complex page layout. The code
C00107 00014 Now let's consider the page-building routine more closely this gives us a chance
C00115 00015 The paragraph-building routine assembles an hlist as described above, and must
C00123 00016 Built-in hyphenation:
C00143 00017 To conclude this memo, I should explain how TEX is going to work on
C00155 ENDMK
C⊗;
Preliminary description of TEX D Knuth, July 15, 1977
In this memo I will try to explain the proposed TEX system for preparing
publishable documents. Some of its rules are still undergoing change, but
for the most part this memo defines the system being implemented, for the
benefit of the implementors. [Note: If you already have read the preliminary
version of this preliminary description, please forget everything that was
in that document and try to forget that it ever existed. Major changes have
occurred, based on the valuable feedback received after circulating that
document, so now let's move on to the real thing.]
TEX is for technical text. Insiders pronounce the X as a Greek Chi (cf. the
Scottish `ch' sound in `Loch Ness') since the word `technical' stems from a
Greek root meaning art as well as technology. I am preparing the system
primarily for use in publishing my series The Art of Computer Programming--
the initial system will be tuned for my books, but it will not be difficult to
extend it for other purposes if anybody wants to do so.
The input to TEX is a file in say TVeditor format at the Stanford AI lab.
The output is a sequence of pages, produced in "one pass," suitable for
printing on various devices. This report tries to explain how to get from
input to output. The main idea is to consider the process as an operation on
two-dimensional "boxes"; roughly speaking, the input is a long string of
characters culled from a variety of fonts, where each character may be
thought of as occupying a small rectangular box, and the output is obtained
by gluing these boxes together either horizontally or vertically with
various conventions about centering and justification, finally arriving at
big rectangular boxes which are the desired pages. While LISP works with
one-dimensional character strings, TEX works with two-dimensional box patterns;
TEX has both horizontal and vertical `cons' operations. Furthermore, TEX has
another important basic concept of elastic glue between boxes, a type of
mortar that stretches and shrinks at different specified rates so that box
patterns will fit together in flexible ways. (I should really use the word
"mortar" instead of "glue" throughout this document, the only trouble is
the extra syllable makes mortar harder to pronounce, and it takes longer
to type the word besides. Maybe the user's manual will say "mortar" consistently;
the present document is emphatically NOT a user's manual.)
In order to explain TEX more fully, I will alternate between very low-level
descriptions of exactly how the processing takes place and very high-level
descriptions of what you type to get complex effects.
First, at the very lowest level, we must realize that the input to TEX is not
really a string of boxes, it is a file of 7-bit characters. This file is called
an "external input file". Seven of the visible printing characters will have
special uses in such files; throughout this memo I will use the symbols
\{}$⊗%# for them, but there will be a way to dedicate other symbols to these
purposes if desired. The seven basic delimiters are
\ the escape character used to indicate control mode rather than text mode
{ beginning of a group
} ending of a group
$ beginning and ending of math formulas
⊗ alignment tab
% beginning of comment
# macro parameter
The first thing TEX does is convert an external input file to an "internal
input file" by essentially obeying the following rules:
1. Delete the first TVeditor directory page, if it exists.
2. Replace the end-of-page marks ('14) on every remaining page by
carriage returns('15). Delete all line-feed symbols ('12), null symbols ('00),
deletion codes ('177), and vertical tabs ('13). Replace all horizontal tabs ('11)
by spaces ('40). Delete all % marks and the sequences of characters following
them up to (but not including) the next carriage return.
3. Delete all blank spaces ('40) following carriage-returns.
4. If two or more carriage returns occur in sequence, replace all
of them by vertical-tab characters ('13). These are used to specify
end of paragraphs in TEX; in other words, the user specifies end of paragraph by
hitting two carriage returns in a row, or by end of page following a
carriage return.
5. Replace all remaining carriage-returns by blank spaces.
6. If two or more blank spaces occur in a row, replace them by a
single blank space.
7. Replace \ by '00, ⊗ by '11, $ by '12, { by '14, } by '15, # by '177
(assuming that these are the basic delimiters mentioned above).
8. Add infinitely many '15 symbols at the right.
The reason for rule 8 is that TEX uses { and } for grouping, and the trailing
'15's (which are equivalents of }'s) will match up with any {'s the user
erroneously forgot to match in the external input file. By following the
above rules, TEX obtains an internal input file containing no appearances
of the seven basic delimiters, and with no two blank spaces in a row. Spacing
in the output document is controlled by other features of TEX, and the seven
basic delimiters can be snuck in if necessary by using e.g. \ascii'173 for
the symbol {. [Special note: At MIT, code '13 is the character ↑; there will
Actually there are nine basic delimiters; the other two are ↑ and ↓, for
superscripts and subscripts respectively, but only within math formulas.
Due to the discrepancies between various vintages of ASCII codes, Stanford
codes are not universal; in particular, code '13 at MIT is the character ↑.
There is a way to specify ↑ as one of the nine basic delimiters, even at MIT,
and TEX will treat it properly -- not deleting it in rule 2 and not confusing it
with the character inserted in rule 4. TEX doesn't really apply rules 1-8 as
stated, it uses an efficient algorithm which has the net effect of these rules.]
Now let`s shift to a high level and show how the user can specify complex
typesetting requirements to TEX. The following example is rather long and
it deserves to be skimmed rather than studied in detail; I devised it
mainly to serve as test data during initial development of the system. Don't
study it now, just glance at it and move to the next part of the memo.
(Note: I based the example on the opening pages of my book Seminumerical
Algorithms, but I skipped over lots of copy when the typesetting presented no
essentially new challenges to the system. Thus, the example concentrates on
difficult constructions, and it is by no means typical. The reader who
eventually does dig into its fine points might find it useful to have the
book in hand for comparison purposes.
\require ACPhdr % 1
%Example TEX input related to Seminumerical Algorithms % 2
\ACPpages starting at page 1: % 3
\titlepage %This tells the page format routine not to put a page number on top % 4
\runninglefthead{RANDOM NUMBERS} % 5
\ljustline{\hexpand 11 pt {\:p CHAPTER \hskip 10 pt THREE}} % 6
\vskip 1 cm plus 30 pt minus 10 pt % 7
\rjustline{\:q RANDOM NUMBERS} % 8
\vskip .5 cm plus 1 pc minus 5 pt % 9
\quoteformat{Anyone who considers arithmetical \cr %10
methods of producing random digits \cr is, of course, %11
in a state of sin. \cr} author{JOHN VON NEUMANN (1951)} %12
\quoteformat{Round numbers are always false.\cr} %13
author {SAMUEL JOHNSON (c. 1750)} %14
\vskip 1 cm plus 1 pc minus 5 pt %15
\runningrighthead{INTRODUCTION} section{3.1} %16
\sectionbegin{3.1. %17
INTRODUCTION} %18
Numbers which are ``chosen at random'' are useful in a wide variety of %19
applications. For example: %20
% This blank line specifies end of paragraph %21
\yskip % This means a bit of extra space between paragraphs %22
\textindent{a)}{\sl Simulation.}\xskip When a computer is used to simulate %23
natural phenomena, random numbers are required to make things realistic. %24
Simulation covers many fields, from the study of nuclear physics (where %25
particles are subject to random collisions) to systems engineering (where %26
people come into, say, a bank at random intervals).\par %27
\yskip\textindent{b)}{\sl Sampling.}\xskip It is often impractical to examine %28
all cases, but a random sample will provide insight into what constitutes %29
``typical'' behavior. %30
%31
\yskip It is not easy to invent a foolproof random-number generator. This fact %32
was convincingly impressed upon the author several years ago, when he attempted %33
to create a fantastically good random-number generator using the following %34
peculiar method: %35
%36
\yskip\yskip\noindent{\bf Algorithm K}\xskip(\sl``Super-random'' number %37
generator.}).\xskip Given a 10-digit decimal number $X$, this algorithm may be %38
used to change $X$ to the number which should come next in a supposedly random %39
sequence.\par %40
\algstep K1. [Choose number of iterations.] Set $Y←\lfloor X/10↑9 \rfloor$, %41
i.e., the most significant digit of $X$. (We will execute steps K2 through K13 %42
$Y+1$ times; that is, we will randomize the digits a {\sl random} number of %43
times.\par %44
\algstep K10. [99999 modify.] If $X<10↑5$, set $X←X↑2 + 99999$; %45
otherwise set $X←X-99999$.\xskip\blackslug %46
%47
\topinsert{\ctrline{\:r Table 1} %48
\ctrline{\:d A COLOSSAL COINCIDENCE: THE NUMBER 6065038420} %49
\ctrline{\:d IS TRANSFORMED INTO ITSELF BY ALGORITHM K.} %50
\vskip 3 pt \hrule %51
\ctrline{\valign{\vskip 6pt\top{#}⊗\vskip 6pt\top{#}\cr %52
\halign{\left{#}\quad⊗\ctr{#}⊗\left{#}\cr %53
Step⊗\$X$ (after)\cr %54
\vskip 10 pt plus 10 pt minus 5 pt %55
K1⊗6065038420\cr K12⊗190586778⊗Y=5\cr} %end of \halign on line 53 %56
\vskip 10 pt plus 10 pt minus 5pt \cr %end of first column to be \valigned %57
\vrule %vertical rule between columns %58
\halign{\left{#}\quad⊗\ctr{#}⊗\left{#}\cr %59
Step⊗$X$ (after)\cr %60
\vskip 10 pt plus 10 pt minus 5 pt %61
K10⊗1620063735\cr %62
K11⊗1620063735\cr K12⊗6065038420⊗Y=0\cr}%end of \halign on line 59 %63
\vskip 10 pt plus 10 pt minus 5pt \cr}} %end of 2nd \valigned column,\ctrline %64
\hrule} %end of the \topinsert on line 48 %65
\yskip\yskip The moral to this story is that {\sl random numbers should not be %66
generated with a method chosen at random.} Some theory should be used. %67
%68
\exbegin %69
\tr\exno 1. [20] Suppose that you wish to obtain a decimal digit at random, not %70
using a computer. Shifting to exercise 16, let $f(x,y)$ be a function such that %71
if $0≤x,y<m$, then $0≤f(x,y)<m$. The sequence is constructed by selecting %72
$X↓0$ and $X↓1$ arbitrarily, and then letting $$ %73
X↓{n+1} = f(X↓n,X↓{n-1}) \qquad {\rm for} \qquad n>0.$$ %74
What is the maximum period conceivably attainable in this case? %75
%76
\exno 17. [10] Generalize the situation in the previous exercise so that %77
$X↓{n+1}$ depends on the preceding $k$ values of the sequence. %78
\par\vskip plus 100 cm\eject %79
\runningrighthead{GENERATING UNIFORM RANDOM NUMBERS} section{3.2} %80
\sectionbegin{3.2. GENERATING UNIFORM RANDOM NUMBERS} %81
In this section we shall consider methods for generating a sequence of random %82
fractions, i.e., random {\sl real numbers $U↓n$, uniformly distributed %83
between zero and one.} Since a computer can represent a real number with only %84
finite accuracy, we shall actually be generating integers $X↓n$ between %85
zero and some number $m$; the fraction$$U↓n = X↓n/m \eqno(1)$$ will %86
then lie between zero and one. %87
%88
\vskip.4in plus.2in minus.2in %89
\runningrighthead{THE LINEAR CONGRUENTIAL METHOD} section{3.2.1} %90
\sectionbegin{3.2.1. The Linear Congruential Method} %91
By far the most successful random number generators known today are special %92
cases of the following scheme, introduced by D. H. Lehmer in 1948. [See %93
{\sl Annals Harvard Comp. Lab.} {\bf 26}(1951), 141-146.] We choose four %94
``magic numbers'':$$ %95
\halign{\right{#}⊗\left{\quad\rm{#}\qquad}⊗\right{#}⊗\left{#}\cr %96
X↓0,⊗the starting value;⊗X↓0⊗≥0.\cr %97
m,⊗the modulus;⊗m⊗>X↓,\quad m>a,\quad m>c.\cr}\eqno(1)$$ %98
The desired sequence of random numbers $\langle X↓n \rangle$ is then %99
obtained by setting$$X↓{n+1}=(aX↓n+c)\mod m,\qquad n≥0.\eqno(2)$$This is %100
called a {\sl linear congruential sequence.} %101
%102
Let $w$ be the computer's word size. The following program computes $(aX+c) %103
\mod(w+1)$ efficiently:$$\halign{{\it#}\qquad⊗\hjust to 25pt{\left{#}}⊗ %104
\left{\tt#}\cr %105
01⊗LDAN⊗X\cr %106
02⊗MUL⊗A\cr 05⊗JANN⊗*+3\cr %107
07⊗ADD⊗=W-1=\qquad\blackslug\cr}\eqno(2)$$ %108
{\sl Proof.}\xskip We have $x=1+qp↑e$ for some integer $q$ which is not a %109
multiple of $p$. By the binomial formula$$ %110
\eqalign{x↑p⊗=1+{p\choose 1}qp↑e+\cdots+{p\choose{p-1}}q↑{p-1} %111
p↑{(p-1)e}+q↑p p↑{pe}\cr %112
⊗=1+qp↑{e+1}\group(){1+1\over p{p\choose 2}qp↑e + 1\over p %113
{p\choose 3}q↑2 p↑{2e}+\cdots+1\over p{p\choose p}q↑{p-1} %114
p↑{(p-1)e}.\cr}$$ By repeated application of Lemma P, we find that %115
\def\mlo#1{\ ({\rm modulo}\ #1)}$$\eqalign{(a↑p↑g - 1)/(a-1)⊗≡ 0 \mlo %116
{p↑g},\cr(a↑p↑g-1)/(a-1)⊗\neqv 0 \mlo{p↑{g+1}}.\cr}\eqno(6)$$ %117
If $1<k<p$, $p\choose k$ is divisible by $p$. \biglpren{\sl Note: }\xskip A %118
generalization of this result appears in exercise 3.2.2-11(a).\bigrpren\ By %119
Euler's theorem (exercise 1.2.4-48), $a↑{\varphi(p↑{e-f})}≡ 1 \mlo %120
{p↑{e-f}}; hence $λ$ is a divisor of$$ %121
λ(p↓1↑{e↓1} \ldots p↓t↑{e↓t} = {\rm lcm}\group() %122
{λ(p↓1↑{e↓1},\ldots,λ(p↓t↑{e↓t})}.\eqno(9)$$ %123
%124
This algorithm in \MIX\ is simply$$ %125
\halign{\hjust to 25pt{\left{\tt#}}⊗\left{\tt#\qquad}⊗\left{#}\cr %126
J6NN⊗*+2⊗\underline{\it A1. }j<0?}\cr %127
STA⊗Z⊗\qquad\qquad→Z.\cr}\eqno(7)$$ %128
That was on page 26. If we skip to page 49, $Y↓1 +\cdots+ Y↓k$ will %129
equal $n$ with probability$$ %130
\sum↓{{y↓1+\cdots+y↓k=n}\atop{y↓1,\ldots,y↓k≥0}} %131
\prod↓{1≤s≤k} %132
{e↑{-np↓s}\group(){np↓s}↑{y↓s}}\over{y↓s!} %133
={e↑{-n}n↑n}\over{n!}.$$ %134
This is not hard to express in terms of $n$-dimensional integrals,$$ %135
{\int↓{α↓n}↑n dY↓n \int↓{α↓{n-1}}↑{Y↓n} %136
dY↓{n-1}\ldots\int↓{α↓1}↑{Y↓2}dY↓1} \over %137
{\int↓0↑n dY↓n \int↓0↑{Y↓n} dY↓{n-1}\ldots %138
\int↓0↑{Y↓2}dY↓1},\qquad{\rm where}\qquad α↓j= %139
\max(j-t,0).\eqno(24)$$ %140
This together with (25) implies that $$\def\rtn{\sqrt n} %141
\mathop{lim}↓{n→∞} s \over \rtn %142
\sum↓{\rtn s<k≤n}{n\choose k}\group(){k\over n - s\over rtn}↑k %143
\group(){s\over\rtn + 1 - k \over n}↑{n-k-1} = e↑{-2s}↑2,\qquad s≥0, %144
\eqno(27)$$ a limiting relationship wich appears to be quite difficult to %145
prove directly. %146
%147
\exbegin\exno 17. [HM26] Let $t$ be a fixed real number. For $0≤k≤n$, let$$ %148
P↓{nk}(x)=\int↑x↓{n-t}dx↓n\int↓{n-1-t}↑{x↓n}dx↓{n-1} %149
\ldots\int↓{k+1-t}↑{x↓{k+2}}dx↓{k+1}\int↓0↑{x↓{k+1}} %150
dx↓k\ldots\int↓0↑{x↓2}dx↓1;$$ %151
Eq. (24) is equal to$$\def\sumk{\sum↓{1≤k≤n}} %152
\sumk X\prime↓k Y\prime↓k \group/.{\sqrt{sumk{X\prime↓k}↑2} %153
\sqrt{\sumk{Y\prime↓k}↑2}}.$$\par %154
\runningrighthead{A BIG MATRIX DISPLAY} section{3.3.3.3} %155
\subsectionbegin{3.3.3.3. This subsection doesn't exist}Finally, look at page %156
91.$$\def\diagdots{\raise 10pt .\hskip 4pt \raise 5pt .\hskip 4pt .} %157
\eqalign{U⊗=\group(){\halign{\ctr{#}⊗\ctr{#}⊗\ctr{#}⊗\ctr{#}⊗\ctr{#}⊗\ctr{#}⊗ %158
\ctr{#}\cr 1\cr ⊗\diagdots\cr ⊗⊗1\cr %159
c↓1⊗\ldots⊗c↓{k-1}⊗1⊗c↓{k+1}⊗c↓n\cr %160
⊗⊗⊗⊗1\cr⊗⊗⊗⊗⊗\diagdots\cr⊗⊗⊗⊗⊗⊗1\cr}},\cr %161
U↑{-1}⊗=\group(){\halign{\ctr{#}⊗\ctr{#}⊗\ctr{#}⊗\ctr{#}⊗\ctr{#}⊗\ctr{#}⊗ %162
\ctr{#}\cr 1\cr ⊗\diagdots\cr ⊗⊗1\cr %163
-c↓1⊗\ldots⊗-c↓{k-1}⊗1⊗-c↓{k+1}⊗-c↓n\cr %164
⊗⊗⊗⊗1\cr⊗⊗⊗⊗⊗\diagdots\cr⊗⊗⊗⊗⊗⊗1\cr}}}$$ %165
This ends the test data, fortunately TEX is working fine. %166
The first thing that must be emphasized about this example is that it is much more
complicated than ordinary TEX input, for reasons stated above. The second thing
that should be emphasized is that it is written in an extension of TEX,
not in the basic language itself. For example, "\ACPpages" in line 3 is a
special routine that prepares pages in the format of The Art of Computer
Programming. The codewords \ACPpages, \titlepage, \runninglefthead,
\runningrighthead, \quoteformat, \author, \sectionbegin, \xskip, \yskip,
\textindent, \algstep, \exbegin, \exno, and \subsectionbegin are specific to
my books and they have been defined in terms of lower-lever TEX primitives
as we shall see below. Furthermore most of the fonts used are embedded in these
hidden definitions; for example, "\sectionbegin" defines the 10 point fonts used
in the text of a section, while "\exbegin" sets up for 9 point type which is used
in the exercises. Another definition is that the word MIX is usually to be set
in the ``typewriter type'' font; the hidden definition
\def\MIX{{\tt MIX}}
causes this to happen automatically in line 125. We shall study TEX's macro
definition mechanism later; three simple examples appear in lines 141, 152,
and 157 of the sample program, where a common idiom was given a short
definition just to save space. For curious people who want to see a more
complicated definition, here is the way \quoteformat is defined:
\def\quoteformat#1 author#2{\lineskip 11 pt plus .5 pt minus 1 pt
\vskip 6 pt plus 2 pt minus 2 pt
\def\rm{\:s} \def\sl{\:t}
{\sl \halign{\right{##}\cr#1}}
\vskip 6 pt plus 2 pt minus 2 pt
\rm \rjustline{#2}
\vskip 11 pt plus 4 pt minus 2 pt}
The word "author" which appears in this definition is not preceded by the
escape character \ since it is scanned as part of the \quoteformat macro.
Please don't expect to understand this mess now, TEX is really very simple;
trust me.
In fact, let's forget all the complications for a moment and try to imagine
TEX at its simplest. Consider the following alternative to the above examples:
A file containing no occurrences of the symbol "\" is preceded by the code
"\deffnt a METS".
Then TEX will output this file in the METS font, with all paragraphs justified.
The very first nonblank character in the external input file is taken by TEX
to be the user's escape character. The user thinks of this character pretty
much as he or she thinks of the "control" key in the editor, since it precedes
system instructions. Normally the file will start with
\require FILENAME
where FILENAME sets up the user's favorite default values; this has been done
in line 1 of our big example. File ACPhdr begins with the sequence
\chcode'173←2. \chcode'176←3. \chcode'44←4.
\chcode'26←5. \chcode'45←6. \chcode'43←7.
\chcode'136←8. \chcode'1←9.
which, at Stanford, defines the characters {}$⊗%# to be the basic delimiters
2,3,4,5,6,7,8, and 9, respectively; but most users won't ever deal with such
low-level trivia since they will be using somebody else's \require file. The
\ACPhdr file also defines codewords like \quoteformat and other standard
Art of Computer Programming conventions.
Now let's penetrate past line 1 of the example and see if we can figure out
any more. The beginning of a chapter is generally complex from a typographic
standpoint, and lines 3-18 of the example are devoted to getting through these
initial complications; the chapter really starts at line 19. Let us now
muster up enough courage to tackle lines 3-18.
The specification of \runninglefthead in line 5 gives the copy that is to appear
in the top line of left-hand pages in the book. Line 6 contains the first actual
text to appear on the page; but look first at line 8, which is simpler:
Line 8 says to use font q (which ACPhdr has defined to be "Computer Modern
Gothic Bold 20 point", a font that I am currently designing) for the words
RANDOM NUMBERS, and to right-justify them on the line (\rjustline).
Line 6 is similar but it uses a different font, font p (which turns out to
be Computer Modern Gothic 11 point type); a TEX user can have up to 32 fonts,
named @, A or a, B or b, ..., Z or z, [, | or <, ], ↑, and ← respectively (cf. ascii
code, any character can be used and its low five bits are relevant).
Font @ is special, it is the only font whose characters are allowed to have
different heights and baselines; the other fonts will define constant baseline
and box height for each of their characters. Furthermore, TEX assumes that some
of its math symbols are on font @.
Continuing in line 6, "\hskip 10 pt" stands for 10 points of horizontal space, and
"\hexpand 11 pt" means to take the line and stretch it 11 points wider than
it would ordinarily be. Note that font definitions within {...} disappear
outside the braces. So do all other definitions.
It is time to explain TEX's mechanism for stretching and shrinking.
Sizes in TEX can be specified in terms of points, picas, inches, centimeters,
or millimeters, using the abbreviations pt, pc, in, cm, mm respectively;
these abbreviations are meaningful only in contexts where physical lengths
are used, and one of these units must always be supplied in such contexts
even when the length is 0. (One inch equals 6 picas equals 72 points equals
2.54 centimeters equals 25.4 millimeters.) The glue between boxes has three
components:
the fixed component, x
the plus component, y
the minus component, z
The x part is "normal" spacing which is used when boxes are strung together
without modification. When expanding a sequence of boxes to more
than their normal length, as on line 6, each x part in the glue is increased
by some multiple of the y part. When shrinking, each x part is decreased by
some multiple of the corresponding z part.
For example, given four boxes bound together by glue of specifications
(x1,y1,z1), (x2,y2,z2), (x3,y3,z3),
expansion of the line by an amount w is achieved by using the spacing
x1 + y1 w', x2 + y2 w', x3 + y3 w',
where w' = w/(y1+y2+y3). The expansion is impossible if w>0 and y1+y2+y3=0.
The system tries hard to minimize bad-looking expansions, by having a
reasonably intelligent justification routine (described below). When
shrinking a line, the maximum amount of contraction is the sum of the z's, no
tighter fit will be attempted; thus, one should define the z width so that
x-z is the minimum space tolerated. The proper setting of y is such that x+y
is a reasonable spacing but x+3y is about at the threshold of intolerability.
Parameters y and z must be nonnegative, but x may be negative (for backspacing
and overlap) if care is used.
The notation "\hskip 10 pt" in line 6 means that x = 10 pt, y = 0, and z = 0
in the horizontal glue between CHAPTER and THREE. If this \hskip hadn't been
used, the normal conventions would have applied. Namely, the glue between
CHAPTER and THREE would have had x=w (the normal interword spacing for the font),
and y = w/8, z = w/2. The glue between letters has (say) x = 0, y = w/18,
and z = w/6; thus the expansion without using \hskip would have
gone mostly into the space between CHAPTER and THREE, but by using
\hskip as shown the expansion spreads out the individual letters. Fonts to
be used with TEX will have such letter spacing indicated as an additional feature;
TEX will also recognize things like the end of sentences, giving slightly more
white space after punctuation marks where appropriate. Symbols like + and =
conventionally are surrounded by spaces, while symbol pairs like '' and := are
moved closer together; the symbols + and = are not surrounded by spaces when
they appear in subscripts. Such things are taken care of in the same way that
ligatures are handled, by making the fonts a little smarter under TEX's control,
as described in more detail later.
Much of TEX is symmetrical between horizontal and vertical, although the basic
idea of building lines first rather than columns first is asymmetrically built
in to the standard routines (because of the way we normally write). The
specification "\vskip" on line 7 specifies vertical glue analogous to
horizontal glue. When the page justification routine tries to fill the first
page of the example text (by breaking between pages ideally at the end of a
paragraph, or in a place where at least two lines of a paragraph appear on
both pages), this glue will expand or shrink in the vertical dimension using
x = 1 centimeter, y = 30 points, z = 10 points. Further variable glue is
specified in the definition of \quoteformat: \lineskip is the inter-line spacing,
and the \vskips give special additional spacing between quotation and author
lines. In the main text, the printer would refer to the type as "10 on 12",
meaning 10 point type with 12 points between corresponding parts of adjacent
lines; TEX will use a 10 point font with
\lineskip 12 pt plus .25 pt minus .25 pt
so that its line spacing is somewhat more flexible. Additional spacing between
paragraphs is given by
\parskip 0 pt plus 1 pt minus 0 pt
and there is other spacing between exercises, steps in algorithms, etc. The
definition
\def\yskip {\vskip 3 pt plus 2 pt minus 2 pt}
is used for special vertical spacing, for example in lines 22 and 37. A horizontal
skip
\def\xskip {\hskip 6 pt plus 3.5 pt minus 3.5 pt}
is used in lines 23, 35, etc. in contexts where a larger than normal space is
psychologically acceptable; for such purposes, flexible glue is especially
useful. Larger horizontal spaces called "\quad" and \qquad are used for example in
line 96; these are "one em" and "two ems" of space, respectively, units frequently
used in mathematical typesetting.
The generality of flexible glue can be appreciated when you consider the
hypothetical definition
\def\hfill {\hskip 0 cm plus 1000 cm minus 0 cm}.
In this case, y is essentially infinite (10 meters long). When such an \hfill code
appears at the beginning of a line, it right justifies that line; when
it appears at both the beginning and end, it centers the line; when it appears
in the middle it neatly partitions the line; and so on. These aspects of
TEX's variable glue seem to make it superior to existing justification
systems, because it provides new features in a rather simple way and at the
same time reduces the old features to a common pattern.
Once a list of boxes has been justified, all the glue is permanently set and the
overall box becomes rigid. Justification is either horizontal or vertical.
Horizontal justification of long lines includes some automatic hyphenation;
further details about justification appear later in this memo.
Now let's look at more of the example. Lines 21 and 31, etc., are blank lines
indicating the end of a paragraph; another way to specify this is ``\par'' in
line 27. Line 124 is an end of a paragraph that ends with a displayed formula.
Paragraphs are normally indented; one of the TEX commands subsumed under
"\sectionbegin" in line 17 is
\parindent 20 pt
which affects the first line of every paragraph unless "\noindent" appears
as on line 37. The "\sectionbegin" routine also specifies "\noindent" on the
very first paragraph of the section, since this is the standard format in
my books. On line 23 we have "\textindent{a)}", which creates a box of
width \parindent containing the characters "a)" followed by a blank space, right
justified in this box.
In line 23 the "\sl" means "use slanted font." I have tentatively decided to
replace italics by a slanted version of the normal "Roman" typeface, for
emphasized words in the text, while variable symbols in math formulas will
still be in italics as usual. I will have to experiment with this, but my
guess is that it will look a lot better, since typical italic fonts do not
tie together into words very beautifully. At any rate I will be distinguishing
slanted from italic, in case it becomes desirable to use two different fonts
for these different purposes. The "\bf" in line 35 stands for boldface. All
these fonts are defined in \sectionbegin, whose definition is hidden to you.
The periods in lines 84 and 101 are ``slanted''; this places them properly
close to the preceding letter, since a little space usually will intervene
when slant mode goes off.
Mathematical formulas all appear between $...$ pairs, cf. lines 38 and 41,
or between $$...$$ pairs (displayed equations). A special syntax is used in
formulas, modeled on that of Kernighan and Cherry, Comm. ACM 18 (March 1975),
151-157. For example, "↑9" in line 41 specifies a superscript 9, "↓{n+1}"
in line 74 specifies a subscript n+1. Math-structure operators like
↑ and ↓ take action only within $'s. All letters in formulas
are set in italics unless they form part of a recognized word or are
surrounded by "{\rm ...}" or "\hjust...", etc. Digits and
punctuation marks like semicolons or parentheses, etc., come out in roman
type in math mode unless specified {\it...} as in line 127. The "1" on that
line will be italic, as will the "j", but the "0" and "?" will be roman.
Spacing within formulas is chosen by TEX, independent of spaces actually
typed, although it is possible to insert space in cases when
TEX's rules are unsatisfactory. In line 116, for example, extra space
has been specified before and after "(modulo" using the code "\ ";
space before parentheses is usually omitted, but it should not be omitted
here. The later parts of the example text are largely concerned with
complicated formulas, which will appear as shown in the corresponding parts
of volume 2. The code "\eqno(24)" (cf. line 140) will insert "(24)" at the right
margin, vertically centered on the corresponding displayed formula, if there
is room, otherwise an attempt is made to move the formula left off-center to
in "(24)", otherwise the "(24)" is stuck on a new line at the bottom right.
The algorithm-step-format keyword "\algstep" used on lines 41 and 45 is defined
as follows:
\def\algstep #1. [#2] {\vskip 3 pt plus 1 pt minus 1 pt
\noindent \hjust to 20 pt{\right{#1.}} [#2]\xskip
\hangindent 20 pt}
This sets vertical spacing glue before the text for the algorithm step, and it
also sets up an appropriate "textindent", etc. The new feature here is the
hanging indent, which affects all but the first line of the following
paragraph.
The keyword "\exno" used on lines 70, 77, etc. has a definition somewhat
similar to \algstep; such definitions of format that are needed frequently in
my books will help to ensure consistency. The "\tr" in line 70 will insert
a triangle in the left margin, using negative glue spacing so that the
character actually appears to the left of the box it is in.
Line 48 begins a "\topinsert", one of the important features needed in page layout.
The box defined within the {...} after \topinsert is set aside and placed on top
of either the present page or the following page (followed by vertical glue
specified by
\topskip 20 pt plus 10 pt minus 5 pt,
this being another thing subsumed by "\sectionbegin"). Box inserts are used for
figures and tables, in this case a table. The table caption appears in lines 48-50;
the table itself (cf. page 6 of the book) is rather complicated, so we will
defer explanation of lines 52-65 until after we have studied the simpler example
of \halign in lines 96-98.
In general, an \halign command takes the form
\halign{ u1#v1 ⊗ ... ⊗ un#vn \cr
x11 ⊗ ... ⊗ x1n \cr
. . . . . . . .
xm1 ⊗ ... ⊗ xmn \cr}
(In addition, \vskip's, \hrule's, and displayed-equation-mode \eqno's are allowed
after the \cr's.) The "\cr" is not a carriage-return, it is the sequence of three
characters \, c, r. The u's and v's are any sequences of characters not including
#, ⊗, or \cr. The meaning is to form the mn horizontal lists ("hlists")
of boxes
u1{x11}v1 ... un{x1n}vn
. . . . . . . . .
u1{xm1}v1 ... un{xmn}vn
and, for each k, to determine the maximum width of hlist uk{xik}vk for i = 1,...,m.
Then each uk{xik}vk is hjustified to the size of this maximum width, and each
line xi1⊗...⊗xin\cr is replaced by the horizontal concatenation of the resulting
boxes, separated by horizontal glue specified by
\htabskip 0 pt plus 1 pt minus 0 pt.
If less than n entries appear on any line before the \cr, the remaining entries
are left blank. When the \halign appears inside $'s, each of the individual
uk{xik}vk hlists is considered to be a separate independent formula.
In the example of \halign on lines 96-98 we have n=4; the first column is to
be right justified, the second is to be treated as "\rm" and surrounded by
quad spaces, then placed flush left in its column,
the third again is right justified, the fourth is simply left-justified.
The result is shown on page 9 of the book, although with TEX the formula number
"(1)" will be centered. Note: Eventually I will put in an "\omittab" feature which
will allow portions of a line to span several columns when desired.
Now let's look at lines 52-65 and compare with Table 1 on page 6 of the book.
Two boxes are built up using \halign and its vertical dual \valign.
The "\eqalign" feature (cf. lines 111, 116) is used to line up operators in
different displayed formulas. Actually this is simply a special case of
\halign:
\def\eqalign #1{\halign{right{##}⊗left{##}\cr#1}}.
Note that line 113 begins with ⊗.
The "\group(){...}" in lines 113, 143, etc. and the "\group/.{...}" in line 152
are used to choose suitably large versions of parentheses and slashes, forming
"(...)" and "/...", respectively; the size of the symbols is determined
by the height of the enclosed formula box. This type of operation is available
for [], <>, ||, left and right braces or floor/ceiling brackets. TEX will
provide the best size available in its repertoire. Parenthesis, brackets,
braces, and vertical lines will be made in arbitrarily large sizes, but slashes
will not, at least not in this year's TEX. Some very large parentheses will be
generated for the big matrices in lines 158ff.
The "\biglpren" and "\bigrpren" on lines 118-119 are not really so big,
but they are larger than the normal ones. The \group() operation will use
these in lines 122-123.
The summation signs produced by "\sum ..." in lines 131, 143, will be
large, since these summations occur in displayed formulas; but between $...$
a smaller sign will be used and the quantity summed over will be attached
as a subscript. Similarly, the size of an integral sign will change, and
fractions "...\over..." do too, as does the binomial coefficient
(cf. "$p \choose k$" in line 118). More about this later.
The \eject in line 79 means to eject a page before continuing.
This again is part of the format of my books, a major section always should
begin on a new page.
I think the above comments on the example give the flavor of TEX. The example
was intended to show a variety of challenging constructions of unusual complexity;
in general, most of a book will be comparatively routine, and it is only the
occasional formula or table which proves intricate. Such intricacies were
purposely emphasized in the example.
The next step in understanding TEX is to go back to the low level and see
what happens to an internal input file as it is being read in. What happens is
that it is massaged once again, converted to a sequence of "tokens", which are
either single characters or "control sequences" which stimulate TEX to do some
work. A control sequence is either \ followed by a single nonletter nondelimiter,
or \ followed by one or more letters (and terminated by the first nonletter).
For example, "\vsize" and "\]" are control sequences; the font-change action
"\:a" is two tokens, the control sequence \: followed by the character a;
the string "\ascii'147" is five tokens, the control sequence \ascii followed
by ',1,4,7. If the character following \ is a letter, and if the control sequence
is terminated by a blank space, this blank space is ignored, effectively removed
from the input file -- its purpose was simply to mark the end of the control
sequence. Thus, for example, "\yskip \yskip" and "\yskip\yskip" are equivalent.
I had to write "\MIX\ " instead of simply "\MIX " on line 125, in order to
obtain the two tokens \MIX and "\ ", the latter space now counting as a real one.
For appearance's sake, TEX also ignores a space following a font-identification
character; e.g. "\:aNow is the time" is equivalent to "\:a Now is the time".
When the control sequence consists of \ and a single character, all printing
characters are distinguished, but when the control sequence consists of \ and a
letter string no distinction is made between upper and lower case letters,
except on the very first letter of the sequence; thus, "\GAMMA" and "\Gamma" are
considered identical, but they are not the same as "\gamma". Furthermore,
letter sequences are considered different only if they differ in the
first seven characters (six if TEX is implemented on a 32-bit machine) or if
they have different lengths mod 8. For example, "\qUotefOmmmm" and
"\quOTEfoxxxxxxxxxxxx" are both equivalent to "\quoteformat". The total number of
different control sequences is therefore approximately
128-14 + 2*(26↑2 + 26↑3 + 26↑4 + 26↑5 + 26↑6 + 8*26↑7)
and this should be enough.
A control sequence is, of course, invalid unless TEX knows its meaning.
TEX knows certain primitive control sequences like "\vsize" and "\ " and "\def",
and the macro facility provided by \def enables it to learn (and forget) other
control sequences like "\MIX" and "\quoteformat".
Here are the precise rules by which TEX reduces the internal input file
to ``pure input'' consisting of tokens in which every control sequence is primitive
and distinct from "\def" and "\require". We already mentioned that \require simply
inserts another batch of input from a file. In the case of \def, one writes in
general
\def<ctrl-seq><string0>#1<string1> ... #k<stringk>{<right-hand side>}
where <string0> ... <stringk> are sequences of zero or more characters not
including { or } or #; spaces are significant in these strings, except the
first character after the defined control sequence is ignored if it is a delimiting
space following a letter string. The <right-hand side> is any sequence of
characters with matching { and }'s, again with significant spaces. I am
describing the general case; in simple situations such as the definition of
\rtn in line 141, k is zero and <string0> is empty. The value of k must be ≤9.
When the defined <ctrl-seq> is recognized, later in the input, a matching
process ensues until the left-hand side is completely matched:
characters in string0...stringk must match exactly with corresponding characters
of the input text, with error messages if a failure to match occurs. Once the
matching process is complete, TEX will have found the ingredients to be
substituted for parameters #1 thru #k in the right-hand side, in the following
way: If <stringj> is empty, #j is the next single character of the input, or
(if this character is "{"), the next group of characters up to the matching "}".
If <stringj> is not empty, #j is 0 or more characters or {...} groups until the
next character of the input equals the first character of <stringj>. No macro
expansion is done during this matching process, and no backing up is done
if a failure occurs; the succeeding characters of the input string must
match the remaining characters of <stringj>. If the parameter #j turns out to be
a single {...} group, the exterior { and } are removed from the group. Note
that the matching process operates on characters of the internal input file,
not on tokens; it is possible, for example, that #j might turn out to be a
single delimiter character like "\".
Once #1 ... #k have been discovered by these rules, they are simply
copied into the positions occupied by #1 ... #k in the right-hand sequence.
And one further change is made to the right-hand side:
the first # is removed from any sequence of two or more #'s. This is
done so that definitions can be made in the right-hand side without causing
confusion with the current definition.
For example, consider what happens when the following definition is active:
\def\A#1 BC {\def\E ##1{##1#1 #1}}.
If the internal input file contains
\A {X-y} BC D
the resulting sequence after expanding the definition will be
\def\E #1{#1X-y X-y}D
(note the spacing).
The above is not the most general sort of macro definition facility, but I
have applied Occam's razor. The reader should now be able to look back at the
definitions given above for \quoteformat, \algstep, and \eqalign, discovering
that they are really quite simple after all.
Assignment actions: I mentioned that the ``pure input'' contains codes for
primitive actions that can be carried out, as well as the characters
being transmitted to the final document. Some of these actions are simple
assignment actions which set parameters informing TEX how to transform the
subsequent input. Like macro definitions, assignment actions have an effect
only until leaving the current {...} or $...$ or $$...$$ group, or until a
reassignment occurs.
Here is a list of TEX's assignment actions:
\chcode'<octal>←<number> defines basic delimiter
\deffnt <char><filename> the real font name corr.to its nickname
\:<char> the current font to be used
\mathrm the font to be used for math functions like log
\mathit the font to be used for math variables like x
\mathsy the font to be used for math symbols like \mu
\ragged or \justified appearance of right margins
\hsize <length> normal width of generated lines of text
\vsize <length> normal height of generated pages of text
\parindent <glue> indentation on first line of paragraphs
\hangindent <glue> indentation on all but first line of paragraph
(hangindent reset to zero after every paragraph)
\lineskip <glue> vertical spacing between generated baselines
\parskip <glue> additional spacing between paragraphs (added to lineskip)
\dispskip <glue>additional spacing above and below displayed formulas
\topskip <glue> additional spacing below an inserted box at top
\botskip <glue> additional spacing above an inserted box at bottom
\htabskip <glue>horizontal spacing between \haligned columns
\vtabskip <glue> vertical spacing between \valigned rows
\output <routine> what to do with filled pages
All of these quantities will have default values, which I will choose after
TEX is up and running; the default values will apply whenever a quantity has
not been respecified. In the above, <length> is of the form
<number><unit> or -<number><unit>
where the <number> is a digit string or a digit string containing a period (decimal
point), and where <unit> is either pt, pc, in, cm, or mm. Furthermore <glue>
is of the form
<length> plus <length> minus <length>
where the plus and minus parts cannot be negative; any of the three lengths
can be omitted if zero, as long as at least one of the three is present.
A blank space after <number> or <unit> is removed from the input.
For example, standard XGP conventions at the moment are to produce 8 1/2 by 11
inch pages with one-inch margins and interline spacing of 4 pixels; in a font 30
pixels high, this would be specified by
\hsize 6.5 in \vsize 9 in \lineskip .17 in
and you could also say
\def\hmargin{1 in} \def\vmargin{1 in}
for the benefit of TEX's default \output routine. (I will explain \output later.)
In a future extension I will include the additional assignment action
\tempmeas <length> next <number> lines
so that TEX can set narrow measure for small illustrations (cf. vol. 1 page 52).
In order to keep from confusing TEX's page-builder and paragraph builder,
changes made to \hsize and \vsize will take effect only when TEX puts the
first line onto a fresh page or the first item into a fresh paragraph.
The assignment actions \chcode and \deffnt do not follow scope rules; they have
``global'' effect.
Control structure: It is now high time to consider TEX's paragraph-building
and page-building mechanisms, and the other aspects of its control structure.
In fact I probably should have started with this explanation long ago, it might
have saved both you and me a lot of confusion.
Internally TEX deals with boxes and ``hlists'' (which are horizontal lists
of boxes separated by horizontal glue) and ``vlists'' (which are vertical lists
of boxes separated by vertical glue). The two kinds of lists are not allowed
to mingle, and TEX must know at any time whether it is building an hlist or
a vlist.
We say that TEX is in ``horizontal mode'' when it is working on an hlist --
intuitively, when it is in the middle of a line -- otherwise it is in
``vertical mode.'' More formally, let us write "h..." for legal TEX input
beginning in horizontal mode, and "v..." for legal TEX input beginning in
vertical mode. Assignment actions don't affect the mode, so they are ignored
in the present discussion.
When in vertical mode, the next token of the (pure) input, not counting
assignment actions, should be one of the following:
v... =
\vskip <glue> v...
(vertical glue, appended to current vlist)
\ljustline{h...}v...,\ctrline{h...}v..., or \rjustline{h...}v...
(append box of width \hsize to current vlist)
\hrule [height <length>] [width <length>]
(append a horizontal rule to current vlist, this is like a
solid black box, default height is .5 pt and default width
is the eventual width of this vlist, namely from the left of
its leftmost box to the right of its rightmost box)
\moveleft <length> v... or \moveright <length> v...
(the next box in the current vlist is to be shifted
wrt the normal left edge, this applies to one box only)
\topinsert{v...}v... or \botinsert{v...}v...
(insert vlist into current page, or onto the next page if
it doesn't fit on the current one)
\halign{...}v...
(returns a vlist of m boxes formed from mn hlists as described
above; the vlist is appended to the current one)
\top{v...}v..., \mid{v...}v..., or \bot{v...}v...
(used mostly in \valign, returns a vlist that \vjust will
top justify, center vertically, or bottom justify)
* \mark{...}v...
(associates titles, etc., with the following lines of text)
* \penalty <number>v...
(additional units of badness if a page break comes here)
<blank space>v... or \ v... or \par v...
(ignored)
* \noindent h...
(initiates a nonindented paragraph)
* <box> h...
(initiates an indented paragraph beginning with this box)
* \eject v...
(ejects the current page unless it is empty)
** \eqno(<number>)v...
(attaches equation number at right of displayed equation)
Here * designates options which are legal only if the current vlist is
being maintained by the page builder, and ** designates an option legal only
in displayed formula mode. The page builder is active at the beginning of
the program but not within other routines (e.g. \topinsert) that construct
vlists.
A box specification is one of the following:
<box> =
<nonblank character>
(the box consisting of that character, in the current font)
\ascii'<octal>
(equivalent to the character, which may be hard to enter otherwise)
\hjust to <length>{h...}
(the hlist converted to box of specified length, if necessary
by breaking it into several lines as in the paragraph routine)
\hjust{h...} or \hexpand <length>{h...}
(the hlist converted to box of its natural line length plus the
specified length... here \hjust is like \hexpand 0 pt)
\vjust to <length>{v...}
(the vlist converted to a box of the specified length,
in this case without the ability to break it apart, sorry)
\vjust{v...} or \vexpand <length>{v...}
(the vlist converted to box of its natural height plus the
specified amount... here \vjust is like \vexpand 0 pt)
\page
(the page just completed, should be used only in \output routine)
\box0, \box1, ... \box9
(the ``global'' box most recently stored by \save0,...,\save9)
In the cases of \page and \boxk, the box is destructively read, not copied,
the next attempt to read it will be an error. Box construction routines and
\output routines may use the designation \savek (0≤k≤9) to store a box into
one of the ten global save areas; again, this box is not copied, it is
manufactured from the current hlist or vlist and the current hlist or vlist
is emptied.
When in horizontal mode, the next token of the (pure) input, not counting
assignment actions, should be one of the following:
h... =
\hskip <glue> h...
(horizontal glue, appended to hlist)
\tjustcol{v...}h...,\midcol{v...}h..., or \bjustcol{v...}h...
(append box of height \vsize to current hlist)
\vrule [height <length>] [width <length>]
(append vertical rule to current hlist, this is analogous to
\hrule but the defaults are .5 pt width and hlist height;
if the height is specified, the rule goes up by this much
starting at the baseline)
\raise <length> h... or \lower <length> h...
(the next box or rule in the current hlist is to be shifted
wrt the normal base line, this applies to one box only)
* \topinsert{v...}h... or \botinsert{v...}h...
(insert vlist into precisely the page that contains the previous
box in the current hlist, e.g. a footnote)
\valign{...}h...
(returns an hlist of m boxes formed from mn vlists as described
above; the hlist is appended to the current one)
\left{h...}h..., \ctr{h...}h..., or \right{h...}h...
(used mostly in \halign, returns an hlist that \hjust will
left justify, center horizontally, or right justify)
\penalty <number>h...
(additional units of badness if a line break comes here)
<blank space>h...
(variable spacing-between-words glue in the current font, but
ignored in math mode)
\ h...
(same as blank space but not ignored in math mode)
<box> h...
(in particular, a nonblank character... the box is appended
to the current hlist)
$...$ h...
(hlist determined in math mode is appended to current hlist)
* $$...$$ h...
(interrupts the current paragraph, which is set to the page
builder, but \hangindent is not cleared... the paragraph resumes
after the closing $$... within the $$'s is a single math
formula or a bunch of them specified by \eqalign or \halign,
they will be centered and appended to the vlist of the page
builder according to the conventions for displayed equations...
appropriate vertical glue is also passed to the page builder...
the additional glue above a displayed equation (\dispskip)
is not added if the text on the preceding line of the paragraph
does not overhang the first displayed equation after centering)
* \par v... or <two consecutive carriage-returns in external input> v...
(end of paragraph, the current hlist is broken into lines
as explained later; the lines are appended to the vlist of
the page-builder)
* \eject h...
(terminates the current ``paragraph'' and the current page, but
the final line of the current ``paragraph'' is justified as
if in mid-paragraph; the text resumes with a new ``paragraph''
which is not indented, nor is \hangindent cleared)
Here * designates options which are legal only if the current hlist is
being maintained by the paragraph builder, which is called into action by the
page builder as explained above.
Boxes have a reference point on their left edge, and this reference point is
used when gluing two boxes together. If the box is a simple character from a
font, the reference point is at the left of the character at the baseline
(i.e., at the bottom of letters like x but not like y; the box extends below
the baseline to accomodate the descending parts of letters). When boxes are
concatenated horizontally, their baselines are lined up (unless otherwise
specified by \raise or \lower). The maximum height above the baseline and the
maximum depth below the baseline are also remembered, in order to determine
the height of the resulting box. When boxes are concatenated vertically, their
left edges are lined up (unless otherwise specified by \moveright or \moveleft).
The distance between consecutive baselines is taken to be \lineskip plus any
additional vertical glue specified by \vskip or \parskip, etc., unless this is too
small to prevent overlap of boxes; in the latter case the boxes are butted
together with zero glue. The baseline of the result is taken to be the baseline
of the bottom line. The maximum distance to the right of the reference edge
is taken to be the width of the resulting box.
The quantity \lineskip is ignored before and after \hrule's. Thus, one may
write for example
\vskip 3pt \hrule \vskip 2pt \hrule \vskip 3pt
to get a double horizontal rule with 2 points of space in between and with
3 points of space separating the double rule from the adjacent lines, regardless of
the current value of \lineskip.
When two consecutive elements of an hlist are simply characters from the same
font, TEX looks at a table associated with that font to see whether or not
special symbols should be specified for this pair of characters. For example,
some of my standard fonts will make the following substitutions:
ff → <ff>
fi → <fi>
fl → <fl>
<ff>i → <ffi>
<ff>l → <ffl>
`` → <``>
'' → <''>
-- → <en-dash>
<en-dash>- → <em-dash>
I will use the codes '11, '12, '13, '14, '15, '175, '177 for the first seven
combinations, since TEX will not confuse them with basic delimiters at
this stage. (Other suggestions for combinations are := → <:=> and, for
fancy coffee-table books that are to be set in an expensive-looking oldstyle type,
ligatures for ct and st.)
Note the en-dash and em-dash here; there are actually four different characters
involved in methematical publishing,
the hyphen (for hyphenating words),
the en-dash (for contexts like "13--20"),
the minus sign (for subtraction),
and the em-dash (for punctuation dashes).
These are specified in TEX as -, --, - within $'s, and ---, respectively.
The above rules for v... and h... summarize most TEX commands, except for the
assignment actions already summarized and for the operations of interest in
page output or math mode.
Here now is the code for \ACPpages which shows complex page layout. The code
uses ``variables'' \tpage and \rhead which are not part of TEX, I am making
use of TEX's macro capability to ``assign'' values to these symbols. Readers
who are not familiar with such a trick may find it amusing, and I guess it
won't be terribly inefficient since pages come along comparatively rarely.
\def \titlepage {\def \tpage{T}} % causes \tpage to be set to T for ``true''
\def \runninglefthead#1 {\def \rhead{{\:m#1}}}
\def \runningrighthead#1 section#2 {\mark
{\ifeven{\hjust to .375 in {\left{\cpage}}\left{\rhead}#2}
\else{#2\right{\:m#1}\hjust to .375 in {\right{\cpage}}}}
\def \ACPpages starting at page #1:
{\setcpage #1 % sets current page number for next page
\output{\lineskip 12 pt % beginning of output routine, resets \lineskip
\vskip \vmargin % skips top margin (\vmargin is defined by user)
\ifT \tpage % the next is used when \tpage is T
{\def \tpage{F} % resets \tpage
\topline % user's special line for top of title pages
\moveright \hmargin % adjust for left margin
\ljustline{\page}
\vskip 3 pt
\moveright \hmargin
\ljustline{\hjust to 29 pc{\:c \ctr{\cpage}}}} % center page no. at bottom
\else {\moveright \hmargin % this format used when \tpage ≠ T
\ljustline{\hjust to 29 pc{\:a \ifeven{\topmark}\else{\botmark}}}
\vskip 12 pt
\moveright \hmargin \ljustline{\page}}
\advcpage} % increase current page number by 1
The \output code is activated whenever the page builder has completed a page.
TEX is then in vertical mode, and the settings of \hsize, \lineskip, etc. are
unpredictable so such things should be reset if they are used. The box defined
by the vlist constructed by the \output routine is output, unless it is
empty (e.g. if it were \save'd).
The TEX actions used in the above code and not explained already are:
\setcpage <number> Sets the current page to a given integer;
if negative, denotes roman numerals.
\advcpage Increases the absolute value of current page
number by one.
\cpage A character string showing the value of the
current page is inserted into the input, as
a decimal number with leading zeroes suppressed
or as a roman numeral (lower case).
\ifeven{α}\else{β} Uses α if current page is even otherwise uses β
(TEX's scanner skips over the other one
one character at a time).
\ifT <char> {α}\else{β} Uses α if <char> is T otherwise uses β.
\topmark, \botmark The \mark operation associates an uninterpreted
string of characters with the set of subsequent
lines received by the page builder, until the
next mark; "\topmark" inserts into TEX's input
the mark associated with the first line on the
current \page, and "\botmark" the mark
associated with the last line, not counting
any \topinserted or \botinserted lines.
I propose to use the following as the default output routine for TEX. It uses
five more actions, namely \day, \month, \year, \time, and \file, corresponding
to the environment which called TEX.
\def \hfill{\hskip plus 100 cm} %``infinite'' stretchability
\lineskip 0 pt %reset space between lines
\:\font %resets to default font character
\vskip \vmargin %skips over top margin
\ifT \notitle {} \else{ %optionally skips title
\ljustline{\hjust to 7.5 in{ %title line has pageno 1 inch from right
\hskip\hmargin %skip left margin
\day\ \month\ \year\hfill %date
\time\hfill %starting time
\file\hfill %principal input file name
\cpage}} %page number
\vskip 12 pt} %one pica skip after title line
\moveright \hmargin \ljustline{\cpage} %insert body of page, skipping left margin
\advcpage %increase page number
\hmargin, \vmargin, \font, \notitle are settable by the user (or SNAIL), and
they in turn have default values.
Now let's consider the page-building routine more closely; this gives us a chance
to study the process TEX uses for vertical justification, which introduces some of
the concepts we will need in the more complicated routine used for horizontal
justification.
The first idea is the concept of ``badness.'' This is a number computed on the
basis of the amount of stretching or shrinking necessary when setting the glue.
Suppose we have a list of n boxes (either a horizontal list or a vertical list),
separated by n-1 specifications of glue. Let w be the desired total length
of the list (i.e., the desired width or height, depending on which dimension we
are justifying); let x be the actual total length of boxes and glue; and let
y,z be the total amount of glue parameters for expansion and contraction. The
badness of this situation is defined to be
infinite, if x - z > w + ε, where ε is a small tolerance to compensate
for floating-point rounding;
100((x-w)/z)↑3, if w + z + ε ≥ x > w;
0, if x = w;
100((w-x)/3y)↑3, if w > x;
plus penalties charged for breaking lines in comparatively undesirable places.
According to these formulas, stretching by y has a badness rating of 100/27,
or about 3.7; stretching by 2y is rated about 30; stretching by 3y is rated
100 units of badness, and so is shrinking by the maximum amount z. I plan to
charge a penalty of something like 80 units for breaking a paragraph or sequence of
displayed formulas in such a way that only one line comes by itself on a page;
thus, for instance, a five-line paragraph gets a penalty of 80 if we break
after the first or fourth line, but no penalty if we break after two or three
lines. I will of course be tuning these formulas to make them correspond as
well as I can to my aesthetic perceptions of bad layout. The user will be
able to specify his own additional \penalty points for undesirable breaking
between specific lines (e.g. in a MIX program to break before an instruction
that refers to *-1).
Breaks are not allowed before or after line rules.
The page-building routine forms a vlist as explained above, accumulating lines of
text and vertical glue until the natural height of its previous accumulation, plus
the k new lines,is greater than or equal to the specified page height, \vsize.
Then it breaks the new paragraph just after the jth line, for some 0≤j≤k, whichever
value of j has the minimum badness; if this minimum occurs for more than one
j, the largest such j is used. Then the glue between lines j and j+1 is discarded,
and the remaining k-j lines are carried over to the next page. (They are immediately
checked to ensure that they don't already overfill the new page, and they are
broken in the same way if necessary.) The \output routine is invoked whenever a
full page has been generated.
A \topinsert or \botinsert interrupts this otherwise straightforward procedure.
The box to be inserted is computed, off to the side, and then an attempt is made
to place it in the current accumulated page. If it fits, well and good, we leave it
there. If not, it is carried over to the next page, in a natural but hard-to-
explain manner, unless the requirement about coming on the same page as a specific
line has to be met (i.e., box insertion in horizontal mode). Then the least bad
legitimate solution will be used.
Footnotes:I have used footnotes only three times in over 2000 pages of The Art of
Computer Programming, and personally I believe they should usually be avoided, so
I am not planning an elaborate footnoting mechanism (e.g. to break long footnotes
between pages or to mark the first footnote per page with an asterisk and the
second with a dagger, etc.). They can otherwise be satisfactorily handled by
\botinsert as defined here. A user will be able to get fancier footnotes if he
or she doesn't mind rewriting a few of TEX's subroutines.
The paragraph-building routine assembles an hlist as described above, and must
break it into lines of width \hsize for transmission to the page-builder.
(Note: There is only one page-builder, in spite of TEX's largely recursive
nature, and there is only one paragraph-builder. However, there can be
arbitrarily many \hjust to <length> routines active at once, and these are
analogous to the paragraph builder in most ways, since they have to break
their hlists into lines too. The discussion about line-breaking applies to
such routines too, but for convenience I will write this as if only the
paragraph-builder has to worry about breaking lines.)
The elements of the paragraph-builder's hlist are usually sequences of text
characters or fragments of math formulas, but they also may be indivisible
boxes constructed by TEX's higher level box operators. In my fonts there is
a small amount of variable glue between the individual text characters
(between the letters a and b, for instance, we would use the glue obtained as
a sum of right-glue for a and left-glue for b, as specified in font tables);
furthermore the spaces between words have more elastic glue as explained earlier.
TEX will give double y glue (but won't change the x glue) to the first space
that follows a period, exclamation point, question mark, or colon, unless
letters or digits or commas or semicolons or boxes intervene before this
space. A semicolon and a comma are treated similarly, but with 1.5 and 1.25 as
the relative amounts of y glue.
The main problem of the paragraph builder is to decide where to break a long
hlist. Again TEX uses the concept of ``badness'' discussed under the page
building routine, but this time it improves on what was done by providing a
``lookahead'' feature by which the situtation in the later lines of a
paragraph can influence the breaks in the earlier lines; in practice this
often provides substantially better output.
Before discussing the lookahead feature, we need to define the location of
all permissible breaks. Every \hskip whose x or y glue exceeds the
spacing width of the current font is an acceptable place to break (and to
omit the horizontal glue) with no penalty. Adjacent \hskips are merged
together, incidentally, by adding the three glue components. Another acceptable
place to break without penalty is after an explicit hyphen or dash.
(Some \hskips, used for backspacing, have negative x; they are, of course,
unacceptable breaks.) The math formula routine which processes $...$ will allow
breaks just after binary operators and relations at the top level; relations
like =, ≤, ≡, etc. have only a small penalty, say 10; operators like +,-,x,/,
mod have a larger penalty, with - and mod larger than the others (say 30, 70, 30,
40, 80, respectively). Superscripts and subscripts are attached unbreakably to
their boxes.
There are four "discretionary" symbols used to provide or inhibit breaks.
First is the \penalty <number> command, which specifies that a break is
admissible if the stated penalty is considered, then there are three more:
\- OK to hyphenate this word here (penalty 30);
\+ do not break here;
\* OK to break here (penalty 30), but insert
a times sign, not a hyphen.
The last of these would be used in a long product like $(n+1)\*(n+2)\*(n+3)\*(n+4)$.
In a minute I will discuss TEX's way of doing automatic hyphenation, but for
the moment let's suppose we know all the candidate places to break lines; now
what is the best way to break up an entire paragraph? I think it is best to define
``best'' as the way that minimizes the sum of the squares of the badnesses of all
the individual breaks. This will tend to minimize the maximum badness as well
as to handle second-order and third-order badnesses, etc. As before, badness
is based on the amount of stretching or shrinking, plus penalty points.
To find the best breaks by this criterion, we don't have an exponentially hard
problem; a dynamic programming algorithm will find the absolutely best way to
break in time O(n↑2), where n is the number of permissible places to break.
Namely, let f(m) be the minimum sum of badness-squareds for the paragraph up
to break position m, then f(m) is the minimum of k<m of f(k) plus the square of
the badness of breaking the text (k,m].
Actually a near-linear approximation to this quadratic algorithm will be
satisfactory: Given the best three places to break the k-th line, we use these
to find the best three places to break the (k+1)st line. When the end of
the paragraph is reached, or if the paragraph is so long that we don't have
enough buffer space (say more than 15 lines long), we clear out our buffers
by backtracing through the f(m) calculation to find the best-known breaking
sequence. In \ragged mode, the lines are not expanded or shrunk to \hsize, but
in \justified mode they are.
Built-in hyphenation:
Besides using the permissible breaks, TEX will try to hyphenate words.
It will do this only in a sequence of lower-case letters in the same font that
are preceded by a space and followed by space, period, or comma. Note that, for
example, capitalized words (which are often foreign names) or already-hyphenated
compound words will not be broken. If a permissible hyphenation break is
discovered, a penalty of 25 units of badness will be paid. An attempt is also
made to avoid hyphenation at the end of the second-last line of a paragraph.
There is no point in finding all possible places to hyphenate. For one thing,
the problem is extremely difficult, since e.g. the word "record" is supposed to
be broken as "rec-ord" when it is a noun but "re-cord" when it is a verb.
Consider the word "hyphenation" itself, which is rather an exception:
hy-phen-a-tion vs. con-cat-e-na-tion
Why does the n go with the a in one case and not the other? Starting at letter
a in the dictionary and trying to find rigorous rules for hyphenation without
much knowledge, we come up against a-part vs. ap-er-ture, aph-o-rism vs. a-pha-sia,
etc. It becomes clear that what we want is not an accurate but ponderously slow
routine that consumes a lot of memory space and processing time, instead we want
a set of hyphenation rules that are
a) simple enough to explain in a couple of pages;
b) almost always safe;
c) powerful enough to find a close enough approximation to, say,
80% of the words already hyphenated in The Art of Computer Programming.
To justify point (c), I find that there are about two hyphenated words per page
in the books, and the places where the rules I shall propose do not find the
identical hyphenation only very rarely would cause a really bad break. The
time needed to handle the remaining exceptions is therefore insignificant by
comparison with what I already do when proof-reading.
So here are the rules TEX uses (found with the help of Frank Liang):
1. If the first seven letters of the word appear in a small internal dictionary
of words to be treated specially (about 350 words in all, see below), use the
hyphenation found in that dictionary. Furthermore some of the entries in the
dictionary specify looking at more than seven letters to make sure that
the exception is real; e.g., "in-form-ant" wouldn't b distinguished from the
unexceptional "in-for-ma-tion" on the basis of seven letters alone. If the given
word has seven letters or less and ends with "s", the word minus the s is also
looked up. The dictionary contains nearly all the common English words for
which the following rules would make an incorrect break, plus additional words
that are common in computer science writing and whose breaks are not satisfactorily
found by the following rules.
2. Suffix removal. A permissible hyphen is inserted if the word ends with
-able(preceded by e,h,i,k,l,o,u,v,w,x,y or "nt" or "rt"), -ary(preceded by "ion" or
"en"), -cal, -cate(preceded by a vowel), -cial, -cious(unless preceded by "s"),
-cient, -dent, -ful, -ize(preceded by "l"), -late(preceded by a vowel), -less, -ly,
-ment, -ness, -nary (unless preceded by "e" or "io"), -ogy, -rapher and -raphy,
-scious, -scope, -scopic, -sion, -sphere, -tal, -tial, -tion, -tion-al, -tive,
-ture. [Here a ``vowel'' is a,e,i,o,u,y, the other 20 letters are ``consonants.'']
There is also a somewhat more complex rule for words ending with "ing":
If "ing" is preceded by fewer than four letters, insert no permissible hyphens.
Otherwise if "ing" is preceded by two identical consonants other than f, l, s, or
z, break between them. Otherwise if it is preceded by a letter other than "l",
break the "-ing". Otherwise if the letter before "ling" is b,c,d,f,g,k,p,t, or z,
break before this letter (except break ck-ling if the word ends with "ckling").
Otherwise break -ing.
Furthermore the same suffix removal routine is applied to the residual word after
having successfully found the suffixes -able, -ary, -ful, -ize, -less, -ly, -ment,
and -ness. If the original word ends in s and no suffix was found, the
final s is removed and the suffix routine is applied again. If
the original word ends in "ed" the suffix routine is applied to the word with the
final d removed, and (if that is unsuccessful) to the word with final "ed" removed.
Any suffixes found are effectively removed from the word, not examined by
rules 3 and 4. If the original word ends with "e" or "s" or "ed", this final
letter or pair of letters is also effectively removed.
3. Prefix removal. A permissible hyphen is inserted if the word begins with
be-(followed by c,h,s, or w), com-, con-, dis-(unless followed by h or y),
equi-(unless followed by v), equiv-, ex-, hand-, horse-, hy*per-, im-,
in- (but use in*ter- or in*tro- if present), lex*i-, mac*ro-, math*e-,
max*i-, min*i-, mul*ti-, non-, out-, over-, pseu*do-, quad-, semi-, some-,
sub-, su*per-, there-, trans-(followed by a,f,g,l, or m),
tri-(followed by a, f, or u), un*der-, un-(unless followed by "der" or "i").
Here an asterisk denotes a second permissible hyphen to be recognized, but
only if the entire prefix appears.
After the prefixes dis-, im-, in-, non-, over-, un- have been recognized the prefix
routine is entered again. Any prefixes found are effectively removed from
the word, and not examined by rule 4.
4. Study of consonant pairs. In the remainder of the word, after suffixes and
prefixes have been removed, we combine the letter pairs ch, gh, ph, sh, th,
treating them as single consonants.
If the three-letter combination XYY is found, where X is a vowel and Y a
consonant, break between the Y's, except if Y is l or s. In the latter case,
break only if the following letter is a vowel and the word doesn't end "XYYer"
or "XYYers".
If the three-letter combination Xck is found, where X is a vowel, break
after the "ck".
If the three-letter combination Xqu is found, where X is a vowel, break
before the "qu".
If the four-letter combination XYZW is found, where X and W are vowels and
Y and Z are consonants, break between the consonants unless YZ is one of
the following pairs:
bl, br, cl, cr, chl, chr, dg, dr, fl, fr, ght, gl, gr, kn, lk, lq,
nch, nk, nx, phr, pl, pr, rk, sp, sq, tch, tr, thr, wh, wl, wn, wr.
Furthermore do not break between the consonants if the word ends with
XYZer, XYZers, XYZage, XYZages, when YZ is one of the pairs
ft, ld, mp, nd, ng, ns, nt, rg, rm, rn, rt, st.
5. After applying rules 1 thru 4, take back all ``permissible'' breaks that
result in only one or two letters after the break, or that have only one
letter before it, or that have only one letter between prefix and suffix.
(Thus, for example, the suffix rule will break -ly, but this won't
count in the final analysis; it does affect the hyphenation algorithm, however,
since the suffixes in words like "rationally" will be found by repeated
suffix removal.)
Also, tack back any break leading to the syllable -e, -xe, or -xye, where
x and y are any two letters and where this e occurs at the end of the shortest
subword on which suffix removal was tried in rule 2. (This rule avoids syllables
with "silent e". For example, we do not wish to hyphenate rid-dle, proces-ses,
was-teful, arran-gement, themsel-ves, lar-gely, and so on.)
Example of hyphenation: su-per-califragilis-ticex-pialido-cious.
(This is a correct subset of the "official" syllabification specified
by the coiners of this word, namely su-per-cal-i-frag-il-is-tic-ex-pi-al-i-do-etc.)
Now here's the DICTIONARY of words which should be handled separately.
(When an asterisk appears, it means this letter is checked too in addition
to the first seven letters.)
First, we include the following words since they are exceptions to the
suffix rules:
(-able) con-trol-lable eq-uable in-sa-tiable ne-go-tiable so-ciable turn-table
un-con-trollable un-so-ciable
(-dent) de-pend-ent in-de-pend-ent
(-ing) any-thing bal-ding dar-ling dump-ling err-ing eve-ning every-thing
far-thing found-ling ink-ling main-spring nest-ling off-spring
play-thing sap-ling shoe-string sib-ling some-thing star-ling ster-ling
un-err-ing up-swing weak-ling year-ling
(-ize) civ-i-lize crys-tal-lize im-mo-bi-lize me-ta-bo-lize mo-bi-lize
mo-nop-o-lize sta-bi-li*ze tan-ta-lize un-civ-i-lized
(-late) pal-ate
(-ment) in-clem-ent
(-ness) bar-on-ess li-on-ess
(-ogy) eu-logy ped-a-gogy
(-scious) lus-cious
(-sphere) at-mos-phere
(-tal) met-al non-metal pet-al post-al rent-al
(-tion) cat-ion cat-ions
(-tive) com-bat-ive
(-ture) stat-ure
Exceptions to the prefix rules:
(be-) beck-on bes-tial
(com-) com-a-tose come-back co-me-dian comp-troller
(con-) cone-flower co-nun-drum
(equi-) equipped
(hand-) handle-bar
(in-) inch-worm ink-blot inn-keeper
(inter-) in-te-rior
(mini-) min-is-ter min-is-try
(non-) none-the-less
(quad-) qua-drille
(some-) som-er-sault
(super-) su-pe-rior
(un-) u-na-nim-ity u-nan-i-mous unc-tous
Exceptions to the consonant rules:
bt: debt-or
ck: ac-know-ledge ac-know-ledg-ment
ct: de-duct-i*ble ex-act-i-tude in-ex-act-i-tude pre-dict-*able re-spect-*able
un-pre-dict-able vict-ual
dl: needle-work idler
ff: buff-er off-beat off-hand off-print off-shoot off-shore stiff-en
ft: left-ist left-over lift-off
fth: soft-hearted
gg: egg-nog egg-head
gn: cognac for-eign-er vi-gnette
gsh: hogs-head
ld: child-ish eld-est hold-out hold-over hold-up
lf: self-ish
ll: bull-ish crest-fallen dis-till-*ery fall-out lull-aby roll-away sell-out
wall-eye
lm: psalm-ist
ls: else-where false-hood
lt: con-sult-ant volt-age
lv: re-solv-able re-volv-er solv-able un-solv-able
mb: beach-comber bomb-er climb-er plumb-er
mp: damp-en damp-est
nch: clinch-er launch-er lunch-eon ranch-er trench-ant
nc: an-nouncer bouncer fencer hence-forth mince-meat si-lencer
nd: bind-ery bound-ary com-mend-*a-*t*ory de-pend-able ex-pend-able
fiend-ish land-owner out-land-ish round-about send-off stand-out
ng: change-over hang-out hang-over ha-rangue me-ringue orange-ade tongue
venge-ance
ns: sense-less
nt: ac-count-ant ant-acid ant-eater count-ess rep-re-sentative
nth: ant-hill pent-house per-cent-*age
pt: ac-cept-able ac-ceptor adapt-able adapt-er crypt-analysis in-ter-rupt-ible
qu: an-tiq-uity ineq-uity iniq-uity liq-uefy liq-uid liq-ui-date liq-ui-da-tion
liq-uor pre-req-ui-site req-ui-si-tion u-biq-ui-tous
rb: ab-sorb-ent carb-on herbal im-per-turb-able
rch: arch-ery arch-an-gel re-search-er un-search-able
rd: ac-cord-ance board-er chordal hard-en hard-est haz-ard-ous jeop-ard-ize
re-corder stand-ard-ize stew-ard-ess yard-age
rf: surf-er
rg: morgue
rl: curl-i-que
rm: af-firm-a-*t*i*ve con-form-*ity de-form-ity in-form-a*nt non-con-form-ist
rn: cav-ern-ous dis-cern-ible mod-ern-ize turn-about turn-over un-gov-ern-able
west-ern-ize
rp: harp-ist sharpen
rq: torque
rs: coars-en ir-re-vers-ible nurse-maid nurs-ery purser re-hears-al re-vers-ible
wors-en
rt: art-ist con-vert-ible court-yard fore-shorten heart-ache heart-ily short-en
rth: apart-heid court-house earth-en-ware north-east north-ern port-hole
rv: nerv-ous ob-serv-a*ble ob-server pre-serv-*a-*t*i*ve serv-er serv-ice-able
sch: pre-school
sc: con-de-scend cre-scendo de-cre-scendo de-scend-ent de-scent pleb-i-scite
re-scind sea-scape
sk: askance snake-skin whisk-er
sl: cole-slaw
sn: rattle-snake
ss: class-ify class-room cross-over dis-miss-al ex-press-ible im-pass-able
less-en pass-able toss-up un-class-i-fied
st: ar-mi-stice astig-ma-tism astir astonish-ment blast-off
by-stander candle-stick cast-away cast-off con-test-ant co-star
de-test-able di-gest-ible east-ern ex-ist-ence fore-stall
in-con-test-able in-di-ges*t-*i*ble in-ex-haust-ible life-style lime-stone
live-stock mile-stone non-ex-ist-ent per-sist-ent pho-to-stat
re-start-ed re-state-ment re-store shy-ster side-step
smoke-stack sug-gest-*i*ble thermo-stat waste-bas-ket waste-land
sth: mast-head post-hu-mous priest-hood
sw: side-swipe
tt: watt-meter
tw: be-tween
tz: kib-itzer
zz: buzz-er
Words which are included since they are common in my vocabulary and need more
hyphens than TEX would find:
al-go-rithm
bib-li-og-raphy
bi-no-mial
cen-ter
com-put-a-*bil-ity
dec-la-ra-tion
de-gree
es-tab-lish
hap-hazard
neg-li-gible
pe-ri-odic
poly-no-mial
pre-vious
prob-a-bil-ity
prob-able
pro-ce-dure
pub-li-ca-tion
pub-lish
re-place-ment
when-ever
To conclude this memo, I should explain how TEX is going to work on
math formulas. I can at least sketch this.
The main operators that need to be discussed are ↓, ↑, \over,
\groupxy, and \sqrt; others are reduced to minor variations on these
themes (e.g., \int and \sum are converted to something similar to ↓
and ↑, \atop is an unruled \over, \underline is like \group, and \vinc
(overline) is like \sqrt). Each math formula is first parsed into a tree,
actually a modified hlist which I shall call a tlist. A tlist is a list of
trees possibly separated by horizontal glue, and a tree is one of the
following:
a box (if not a character box then it was constructed with mathmode off);
the node \sub with a tree as left son and a tlist as right son;
the node \sup with a tree as left son and a tlist as right son;
the node \subsup with a tree as left son, tlists as middle and right sons;
the node \over with tlists as left and right sons;
the node \sqrt with tlist as son;
the node \group with bracket characters as left and middle sons and
with a tlist as right son.
Best results will be obtained when using a family of three fonts of varying
sizes. The definition
\fntfam <char><char><char>
defines such a family in decreasing order of size. For example, TEX will be
initially tuned to work with the following set of font definitions:
\deffnt a cm10 \deffnt g cmi10 \deffnt u cmath10
\deffnt b cm9 \deffnt h cmi9 \deffnt v cmath9
\deffnt c cm8 \deffnt i cmi8 \deffnt w cmath8
\deffnt d cm7 \deffnt j cmi7 \deffnt x cmath7
\deffnt e cm6 \deffnt k cmi6 \deffnt y cmath6
\deffnt f cm5 \deffnt l cmi5 \deffnt z cmath5
\fntfam adf \fntfam bef \fntfam gjl \fntfam hkl \fntfam uxz \fntfam vyz
\mathrm a \mathit g \mathsy u (in the text)
\mathrm b \mathit h \mathsy v (in the exercises)
These are 10 pt thru 5 pt fonts of "Computer Modern" and "Computer Modern Italic";
8 pt type actually doesn't get used in formulas, only at the bottom of title
pages and in the index.
Characters within math formulas will be adjusted to use the appropriate font
from a family if the current font appears as the first (largest) of some
declared family; otherwise the single font by itself will be treated as
a ``family'' of three identical fonts (i.e., using the same size in
subscripts as elsewhere).
After a math formula has been completely parsed into a tlist, TEX goes from
top to bottom assigning one of five modes to the individual trees:
A display mode
B text mode
C text mode with lower subscripts
D script mode
E scriptscript mode
Later on, modes ABC will use the size of the first of a font family, while
D and E will use the sizes of the second and third, respectively. The following
table shows how TEX determines the modes of the sons of a tree node, given the
mode of the father:
father \sub \sup \subsup \over \sqrt \group
A AD AD ADD BC C A
B BD BD BDD DD C B
C CD CD CDD DD C C
D DE DE DEE EE D D
E EE EE EEE EE E E
Large summation and integral signs, etc., are used only in mode A.
Once the modes are assigned, TEX goes through bottom up, converting all trees
to boxes by setting the glue everywhere except at the highest level tlist,
which becomes an hlist (passed to the paragraph-builder or whatever). Incidentally,
if you want to understand why TEX does a top-down pass and then a bottom-up
pass, note that for example the numerator of \over isn't known to be a
numerator at first; consider "1 \over 1", where the "1" is supposed to be
mode D. Furthermore the \subsup nodes can originate either from
...↓...↑... or from ...↑...↓...
since I found that some typists like to do subscripts first and others like
to do superscripts first. Incidentally, when TEX parses a formula, ↓ and
↑ have highest precedence, then \sqrt, then \over;
x ↓ y ↓ z and x ↑ y ↑ z
are treated as
x ↓{y ↓ z} and x ↑{y ↑ z},
while constructions such as
x ↓ y ↑ z ↓ w
are illegal.
The first font of a family should possess tables that tell TEX how much to
raise the baseline of superscripts, lower the baseline of subscripts, and
position the various baselines of the \over construct, as a function of the mode
and the node. For example, in the fonts I am designing, the 7-point superscript
of an unsubscripted 10-point box will have its baseline raise 11/3 pt in B mode,
26/9 pt in C mode; the subscript of an unsuperscripted box will have baseline
lowered 3/2 pt in both B and C modes; and when both sub- and super-scripts are
present the subscript will be lowered 11/4 pt and the superscript raised
11/3 pt or 26/pt (or more if necessary to appear above a complex subscript).
Subscripts and superscripts on more complex boxes (e.g. groups) are positioned
based on the lower and upper edges of the box.
Displayed formulas are never broken between lines by TEX; the user is supposed
to figure out the psychologically best place to break them. Since TEX has
negative glue components, it will be possible to squeeze longish formulas onto
a line. Multiple displayed formulas should be separated by the \cr's of
\halign or \eqalign.